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Abstract —In this work, we present a family of architec¬ 
tures for polar decoders using a reduced-complexity successive- 
cancellation decoding algorithm that employs unrolling to achieve 
extremely high throughput values while retaining moderate 
implementation complexity. The resulting fully-unrolled, deeply- 
pipelined architecture is capable of achieving a coded throughput 
in excess of 1 Tbps on a 65 nm ASIC at 500 MHz—three orders of 
magnitude greater than current state-of-the-art polar decoders. 
However, unrolled decoders are built for a specific, fixed code. 
Therefore we also present a new method to enable the use of 
multiple code lengths and rates in a fully-unrolled polar decoder 
architecture. This method leads to a length- and rate-flexible 
decoder while retaining the very high speed typical to unrolled 
decoders. The resulting decoders can decode a master polar code 
of a given rate and length, and several shorter codes of different 
rates and lengths. We present results for two versions of a multi- 
mode decoder supporting eight and ten different polar codes, 
respectively. Both are capable of a peak throughput of 25.6 Gbps. 
For each decoder, the energy efficiency for the longest supported 
polar code is shown to be of 14.8 pj/bit at 250 MHz and of 
8.8 pj/bit at 500 MHz. 

Index Terms —polar codes, ASIC, high throughput, multi- 
mode, unrolled architecture 

I. Introduction 

P OLAR codes are gathering a lot of attention lately. They 
are error-correcting codes with an explicit construction 
that provably achieve the symmetric capacity of memoryless 
channels with a low-complexity decoding algorithm: succes¬ 
sive cancellation (SC) [1]. As SC proceeds bit-by-bit, hardware 
implementations suffered from low throughput and high la¬ 
tency [2]—[5]. To overcome this, modified SC-based algorithms 
were proposed [6]—[ 10]. The first hardware implementation 
with a throughput greater than 1 Gbps was presented in [9]. 

In [11], a fully-unrolled deeply-pipelined hardware architec¬ 
ture for polar decoders was proposed. Results showed a very 
high throughput, greater than 200 Gbps on FPGA. However, 
these architectures are built for a fixed polar code i.e. the 
code length or rate cannot be configured after designing 
the decoder. This is a major drawback for most modern 
wireless communication applications that largely benefit from 
the support of multiple code lengths and rates. Furthermore, 
a deeply-pipelined architecture causes the area to grow very 
fast with the frame size. 

The goal of this paper is twofold. First, it is to generalize the 
unrolled architecture presented in [11] into a family of archi¬ 
tectures offering a flexible trade-off between throughput, area 
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and energy efficiency. The (1024, 512) fully-unrolled deeply- 
pipelined polar decoder implementation of [11] is significantly 
improved on all metrics. Second and most importantly, it 
is to show how an unrolled decoder built specifically for 
a polar code, of fixed length and rate, can be transformed 
into a multi-mode decoder supporting many codes of various 
lengths and rates. More specifically, we show how decoders 
for moderate-length polar codes contain decoders for many 
other shorter—but practical—polar codes of both high and low 
rates. The required hardware modifications are detailed, and 
ASIC synthesis and power estimations are provided for the 
65 nm CMOS technology from TSMC. Results show a peak 
information throughput greater than 15 Gbps at 250 MHz in 
4.29 mm 2 or greater than 20 Gbps at 500 MHz in 1.71 mm 2 . 
Latency is of 2 /is and 650 ns for the former and latter. 

The remainder of this paper starts with Section II by 
briefly reviewing polar codes, their construction and their 
representation. Section III provides the necessary background 
on the Fast Simplified Successive-Cancellation (Fast-SSC) 
decoding algorithm. Section IV describes the proposed family 
of unrolled hardware architectures. The concept, hardware 
modifications and other practical considerations related to the 
proposed multi-mode decoder are presented in Section V. 
Error-correction performance and implementation results for 
both dedicated and multi-mode decoders are provided in 
Section VI. Comparison against the fastest state-of-the-art 
polar decoder implementations in the literature is carried out 
in Section VI as well. Finally, a conclusion is drawn in 
Section VII. 

II. Polar Codes 

A. Construction 

Polar codes exploit the channel polarization phenomenon 
by which the probability of correctly estimating codeword 
bits tends to either 1 (completely reliable) or 0.5 (completely 
unreliable). These probabilities get closer to their limit as the 
code length increases when a recursive construction such as 
the one shown in Fig. 1 is used, where © represents a modulo- 
2 addition (XOR). Under successive-cancellation decoding, 
polar codes were shown to achieve the symmetric capacity 
of memory less channels as their code length N —» °o [1], 

An (V, k) polar code has length V, carries k information 
bits and is of rate R — k /N. The other N — k bits—frozen bits— 
are set to a predetermined value—usually zero—during the 
encoding process. The grayed n,’s where i e [0,1,2,4} on the 
left hand side of Fig. 1 correspond to frozen bit locations of 
a (16, 12) polar code. 
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Fig. 1: Graph representation of a (16,12) polar code. 

Depending on the type of channel and its conditions, the op¬ 
timal location of the frozen bits varies and can be determined 
using the method described in [12] for example. 

Encoding schemes for polar codes can be either non- 
systematic, as shown in Fig. 1, or systematic as discussed in 
[13]. Systematic polar codes offer better bit-error rate (BER) 
than their non-systematic counterparts; while maintaining the 
same frame-error rate (FER). A low-complexity systematic 
encoding method was presented in [9] and proven to be correct 
in [14]. In this work, we use systematic polar codes. 

Both encoding types use the same generator matrix, and as 
this matrix is built recursively, so are polar codes i.e. a code 
of length N is the concatenation of two codes of length N /i. 


B. Representation 

Fig. 1 shows the graph representation of a (16,12) polar 
code where the blue-dashed-circled v represents a concatena¬ 
tion of two codes of length 4, a (4,1) polar code with a (4,3) 
one, yielding an (8,4) polar code. 

As polar codes are built recursively, it was proposed in 
[6] to represent them as binary trees. Fig. 2a illustrates 
such a representation, called decoder tree, equivalent to the 
graph of Fig. 1. In the decoder tree, white and black leaves 
represent frozen and information bits, respectively. Leaf nodes 
correspond to individual bits denoted Uj, where 0 < i < N, 
and where the largest position index i is on the right hand 
side of the tree. Moving up in the decoder tree corresponds 
to the concatenation of constituent codes. For example, the 
concatenation operation circled in blue in Fig. 1 corresponds 
to the node labeled v in Fig. 2a. 

The left-hand-side (LHS) and right-hand-side (RHS) sub¬ 
trees rooted in the top node are polar codes of length N /i. In 
the remainder of this paper, we designate the polar code, of 
length N, decoded by traversing the whole decoder tree as the 
master code and the various codes of lengths smaller than N 
as constituent codes. 

By definition, and like the master code, a constituent code 
of length N /i is in turn the concatenation of two polar codes 
of length n /a, and so on until the leaf nodes are reached. As 
such, the decoding of a polar code of length N can be seen 




(b) Fast-SSC 


Fig. 2: Decoder trees for SC (a) and Fast-SSC (b) decoding 
of a (16, 12) polar code. 


as the decoding of two constituent codes of length N /i, or of 
four constituent codes of length n /a , etc. For example, and as 
shown in the graph representation of Fig. 1, but better seen 
in the decoder tree representation of Fig. 2a, a master code 
of length 16 is the concatenation of two constituent codes of 
length 8, or of four constituent codes of length 4, or of eight 
constituent codes of length 2. 

It should be noted that sibling constituent codes with the 
same parent node share a special relation. Let us consider 
the polar code (constituent code) of length N v = 8 taking 
root in v as illustrated in Fig. 2a, as the concatenation of two 
constituent codes of length N ^/i = 4. As that polar code gets 
decoded, the estimated bits /3/ from its LHS constituent code 
are required to compute the soft inputs a, required to decode 
its RHS constituent code. Furthermore, once the estimated bits 
/3 r are obtained by decoding the RHS constituent code, they 
are combined with to form the bit-estimate vector j3,, for v. 

III. The Fast-SSC Decoding Algorithm 

As mentioned above, a polar code is the concatenation of 
smaller constituent codes. Instead of using the SC algorithm 
on all constituent codes, the location of the frozen bits can 
be taken into account to use more efficient, lower complexity 
algorithms on some of these constituent codes [6], [9], 

Fig. 2b shows the decoder tree equivalent to Fig. 2a, but 
when key parts of the Fast-SSC decoding algorithm [9] are 
used. The black node represents a rate-1 constituent code 
i.e. a polar code entirely composed of information bits. The 
green striped and orange cross-hatched nodes are repetition 
and single-parity-check (SPC) constituent codes, respectively. 
Gray nodes are codes of rate 0 < R < 1. It can be seen that 
Fast-SSC visits fewer nodes in the decoder tree, significantly 
decreasing the latency and increasing the throughput. It pro¬ 
vides the same codeword estimates as SC though, hence offers 
the same error-correction performance. 

While the proposed multi-mode unrolled decoders are in¬ 
dependent of the decoding algorithm, we briefly go over the 
decoding operations mentioned in this paper. 


Decoding Operations 

Three functions are inherited from the original SC algo¬ 
rithm and log-likelihood ratios (LLRs) are used for the soft 
messages. Going down a left edge—colored blue in Fig. 2—, 
a/ is calculated with the min-sum approximation [3] 

oci[i] = sgn(o£ v [i] • a v [i + M/ 2 ])min(|a,[/]|, \a v [i + ^Q, (l) 
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for 0 < i < N '/i, where (X v is the input to the node and N v 
the width of a v . Going down a right edge—colored red in 
Fig. 2—, a, is calculated with 


| a v [i + w >/ 2 ] + a v [i], when AH = 0; 

a r [i\ = [ ( 2 ) 

I Gk[i + w >/ 2 ] - a„[i], otherwise, 

for 0 < i < N ' : / 2, where pi is the bit estimate from the LHS 
child. 

Once a leaf node is reached, the bit estimate is set to zero 
when it corresponds to a frozen bit location. Otherwise, it 
is calculated by threshold detection on a v . Going back up a 
RHS edge the bit estimates from both children are combined 
to generate the node’s bit-estimate vector 


ft m = | AM © A M. when i < Fji\ 

' | P r [i - N pi\, when N p 2 < i < N v , 

where © is modulo-2 addition (XOR). 

In [6], the Simplified SC (SSC) algorithm is introduced 
where decoder tree nodes are split into three categories: Rate- 
0, Rate-1, and Rate-R nodes. 

1) Rate-0 Nodes: are subtrees whose leaf nodes all cor¬ 
respond to frozen bits. We do not need to use a decoding 
algorithm on such a subtree as the exact decision, by definition, 
is always the all-zero vector. 

2) Rate-1 Nodes: are subtrees where all leaf nodes carry 
information bits, none are frozen. The maximum-likelihood 
decoding rule for these nodes is to take a hard decision on the 
input LLRs: 


= | 0 ' Wh "’ “" [il ^ ° ; (4) 

11, otherwise, 

for 0 < i < N v . With a fixed-point representation, this 
operation amounts to copying the most significant bit of the 
input LLRs. 

3) Rate-R Nodes: Lastly, Rate-R nodes, where 0 < R < 1, 
are subtrees such that leaf nodes are a mix of information 
and frozen bits. As shown in [9], instead of always using the 
SC or SSC algorithm, some Rate-R nodes corresponding to 
specific frozen-bit locations can be decoded using algorithms 
with lower complexity and latency. The subset of nodes and 
operations from [9] used in our proposed family of architec¬ 
tures are briefly reviewed in the following. 

4) F, G and GOR Operations: The F and G operations are 
among the functions used in the conventional SC decoding 
algorithm and are calculated using (1) and (2), respectively. 

GOR is a special case of the G operation where the left child 
is a frozen node i.e. Pi is known a priori to be the all-zero 
vector of length N/i. 

5) Combine and COR Operations: As defined by (3), the 
Combine operation generates the bit estimate vector. A COR 
operation is a special case of the Combine operation where 
the LHS constituent code, /j/, is a Rate-0 node. 

6) Repetition Node: In this node, all leaf nodes are frozen 
bits, with the exception of the node that corresponds to the 
most RHS leaf in a tree. At encoding time, the only informa¬ 
tion bit gets repeated over the N v outputs. The information bit 


can be estimated by using threshold detection over the sum of 
the input LLRs a v : 


A = 


0, when (S^- 1 «„[*]) > 0; 
1, otherwise. 


where A- gets replicated N v times to create the bit-estimate 
vector. 

7) Single-parity-check (SPC) Node: An SPC node is a node 
such that all leaf nodes are information bits with the exception 
of the node at the least significant position (LHS leaf in a tree). 
To decode an SPC code, we start by calculating the parity of 
the input LLRs: 


N v -1 

parity = ^ AM, where AM 

i=0 


jo, when a,,[i] > 0; 
11, otherwise. 


The estimated bit vector is then generated by reusing the 
calculated p v above unless the parity constraint is not satisfied 
i.e. is different than zero. In that case, the estimated bit 
corresponding to the input with the smallest LLR magnitude 
is flipped: 


AM = AM © 1 , where i = arg min(|a v [y']|). 

i 

Our proposed decoders borrow from the Fast-SSC algorithm 
in that it uses specialized nodes and operations described 
above to reduce the decoding latency. However, the family of 
architectures we propose greatly differs from the processor¬ 
like architecture of [9], Moreover, [9] proposes hybrid node 
types combining the ones above in order to further reduce the 
decoding latency. With the exception of the RepSPC node—a 
specialized node decoding a Repetition code concatenated with 
an SPC code—that is used in one of the implementations, we 
do not use those hybrid nodes in this paper. 


IV. Unrolled Architectures 

In an unrolled decoder, each and every operation required 
is instantiated so that data can flow through the decoder with 
minimal control. 

The idea of fully unrolling a decoder has previously been 
applied to decoders for other families of error-correcting 
codes. Notably, in [15], [16], the authors propose a fully- 
unrolled deeply-pipelined decoder for an LDPC code. Polar 
codes are more suitable to unrolling as they do not feature a 
complex interleaver like LDPC codes. 


A. Deeply Pipelined 

In a deeply-pipelined architecture, a new frame is loaded 
into the decoder at every clock cycle. Therefore, a new 
estimated codeword is output at each clock cycle as each 
register is active at each rising edge of the clock (no enable 
signal required). In that architecture, at any point in time, 
there are as many frames being decoded as there are pipeline 
stages. This leads to a very high throughput at the cost of 
high memory requirements. Some pipeline stage paths do not 
contain any processing logic, only memory. They are added to 
ensure that the different messages remain synchronized. These 
added memories yield register chains, or SRAM blocks. 
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Fig. 3: Fully-unrolled deeply-pipelined decoder for a (8, 4) 
polar code. Clock signals omitted for clarity. 



Fig. 4: Fully-unrolled partially-pipelined decoder for a (8, 4) 
polar code with 1 = 2. Clock signals omitted for clarity. 


Fig. 3 shows a fully-unrolled and deeply-pipelined decoder 
for a (8,4) polar code. The a and [5 blocks illustrated in light 
blue are registers storing LLRs or bit estimates, respectively. 
White blocks are the functions described in Section III and 
dotted registers are regular registers but will be referred to 
in the next section. Among the registers, two are needed to 
retain the channel LLRs, denoted a c in the figure, during the 
2 nd and 3 rd clock cycles. Similarly, two registers have to be 
added for the persistence of the hard-decision vector j3i over 
the 4 th and 5 th clock cycles. Such unrolled architectures for 
polar decoders were described in [11], 

The information throughput can be defined as PfR bps, 
where P is the width of the output bus in bits, / is the 
execution frequency in Hz and R is the code rate. In this 
paper, P is assumed to be equal to the code length N. The 
decoding latency depends on the frozen bit locations and the 
constrained maximum width for all processing nodes, but is 
less than A' log 2 A'. In our experiments, with the operations 
and optimizations described below, the decoding latency never 
exceeded N /i clock cycles. 

B. Partially Pipelined 

In a deeply-pipelined architecture, a significant amount of 
memory is required for data persistence. That memory quickly 
increases with the code length N. Instead of loading a new 
frame into the decoder and estimating a new codeword at every 
cycle, we propose a compromise where the unrolled decoder 
can be partially pipelined to reduce the required memory. Let 
I be the initiation interval, where a new estimated codeword is 
output every I clock cycles. The case where I - 1 translates 
to a deeply-pipelined architecture. We note that the interval 
only affects the memory, not the computational elements, in 
the decoder. 

Setting I > 1 leads to a significant reduction in the memory 
requirements. An initiation interval of I translates to an 
effective required register chain length of \ L /f\ instead of L, 
where L is the length of the register chain. Using 1 = 2 leads 
to a ~ 50% reduction in the amount of memory required for 
that section of the circuit. This reduction applies to all register 
chains present in the decoder. A partially-pipelined decoder 
with 1 = 2 can be obtained for a (8,4) polar code by removing 
the dotted registers in Fig. 3, leading to the decoder of Fig. 4. 

The initiation interval I can be increased further in order 
to reduce the memory requirements, but only up to a certain 
limit. We call that limit the maximum initiation interval i~ max , 
and its value depends on the decoder tree. By definition, the 
longest register chain in a fully-unrolled decoder is used to 


preserve the channel LLRs a c . Hence, the maximum initiation 
interval corresponds to the number of clock cycles required for 
the decoder to reach the last operation in the decoder tree that 
requires a c , G v, the operation calculated when going down the 
right edge linking the root node to its right-hand-side child. 
Once that G v operation is completed, a, is no longer needed 
and can be overwritten. As an example, consider the (8,4) 
polar decoder illustrated in Fig. 4. As soon as the switch to 
the right-hand side of the decoder tree occurs, i.e. when G is 
traversed, the register containing the channel LLRs a c can be 
updated with the LLRs for the new frame without affecting the 
remaining operations for the current frame. Thus the maximum 
initiation interval, i~ max , for that decoder is 3. 

The resulting coded and information throughput are 


T c = 


N-f 

I 


and 77 = 


N-f-R 

I ’ 


(5) 


respectively, where I is the initiation interval. Note that this 
new definition can also be used for the deeply-pipelined archi¬ 
tecture. The decoding latency remains unchanged compared to 
the deeply-pipelined architecture. 

Fig. 5 shows a fully-unrolled partially-pipelined decoder 
with an initiation interval 1 = 2 for the (16,12) polar code of 
Fig. 2b. Some control and routing logic was added to make 
it multi-mode as detailed in the next section. The “&” blocks 
are bit-vector joining operators. 

The partially-pipelined architecture requires a more elab¬ 
orate controller than the deeply-pipelined architecture. For 
both fully- and partially-pipelined architectures, the controller 
generates a done signal to indicate that a new estimated 
codeword is available at the output. For the partially-pipelined 
architecture, the controller also contains a counter with max¬ 
imum value of (T - 1) which generates the I enable signals 
for the registers. An enable signal is asserted only when the 
counter reaches its value, in [0 ,1 - 1], otherwise it remains 
deasserted. Each register uses an enable signal corresponding 
to its location in the pipeline modulo I. As an example, let 
us consider the decoder of Fig. 5, i.e. I is set to 2. In that 
example, two enable signals are created and a simple counter 
alternates between 0 and 1. The registers storing the channel 
LLRs a c are enabled when the counter is equal to 0 because 
their input resides on the even (0, 2, 4 and 6) stages of the 
pipeline. On the other hand, the two registers holding the cq 
LLRs are enabled when the counter is equal to 1 because their 
inputs are on odd (1 and 3) stages. The other registers follow 
the same rule. 

The required memory resources could be further reduced 
by performing the decoding operations in a combinational 
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Fig. 5: Unrolled partially-pipelined decoder for a (16,12) polar code with initiation interval 1-2. Clock, flip-flop enable and 
multiplexer select signals are omitted for clarity. 


manner, i.e. by removing all the registers except the ones 
labeled a c and /3 C , as in [17]. However, the resulting reachable 
frequency is too low for the desired throughput level. 

C. Replacing Register Chains with SRAM Blocks 

As the code length N grows, long register chains start to 
appear in the decoder, especially with a smaller I. In order 
to reduce the number of registers required, register chains can 
be converted into SRAM blocks. 

Consider the register chain of length 4 used for the persis¬ 
tence of the channel LLRs a c in the fully-unrolled partially- 
pipelined (16,12) decoder shown in top row of Fig. 5. Pre¬ 
serving the first register, the remaining 3 registers in that chain 
can be replaced by a dual-port SRAM block with a width of 
16(2 bits— Q is the number quantization bits—and depth of 3 
along with a controller to generate the appropriate read and 
write addresses. Similar to a circular buffer, if the addresses 
are generated to increase every clock cycle, the write address 
is set to be one position ahead of the read address. 

SRAM blocks can replace register chains in a deeply- 
pipelined architecture as well. In both architectures, the SRAM 
block depth has to be equal or greater than the register chain 
length minus one. 

V. Multi-mode Unrolled Decoders 

It can be noted that an unrolled decoder for a polar code 
of length N is composed of unrolled decoders for two polar 
codes of length N /i, which are each composed of unrolled 
decoders for two polar codes of length N /4, and so on. Thus, 
by adding some control and routing logic, it is possible to 
directly feed and read data from the unrolled decoders for 
constituent codes of length smaller than N. The end result is a 
multi-mode decoder supporting frames of various lengths and 
code rates. 

A. Hardware Modifications to the Unrolled Decoders 

Consider the decoder tree shown in Fig. 2b along with its 
unrolled implementation as illustrated in Fig. 5. In Fig. 2b, the 
constituent code taking root in v is an (8,4) polar code. Its 
corresponding decoder can be directly employed by placing 
the 8 channels LLRs into a n Q and by selecting the bottom 


input of the multiplexer m \ illustrated in Fig. 5. Its estimated 
codeword is retrieved from reading the output of the Combine 
block feeding the /j 4 register i.e. by selecting the top and 
bottom inputs from w 4 and ;;; 5 , respectively, and by reading 
the 8 least-significant bits from /3 0 15 . Similarly, still in Fig. 5, 
the decoders for the repetition and SPC constituent codes 
can be fed via the m 2 and m 3 multiplexers and their output 
eventually recovered from the output of the Rep and SPC 
blocks, respectively. 

Although not illustrated in Figs. 3, 4 or 5, the proposed 
unrolled decoders feature a minimal controller. While not 
mandatory, the functionality of this controller is altered to 
better accommodate the use of multiple polar codes. Two look¬ 
up tables (LUTs) are added. One LUT stores the decoding 
latency, in clock cycles, of each code. It serves as a stopping 
criteria to generate the done signal. The other LUT stores the 
clock cycle “value” z start at which the enable-signal generator 
circuit should start. Each non-master code may start at a 
value (z s t art mod I ) 4 0. In such cases, using the unaltered 
controller would result in the waste of (Z start mod I) clock 
cycles. It can be significant for short codes, especially with 
large values of I. For example, without these changes, for the 
implementation with a master code of length 1024 and I = 20 
presented in Section VI below, the latency for the (128,96) 
polar code would increase by 20% as (Z start mod I) = 17 and 
the decoding latency is of 82 clock cycles. 

Lastly, the modified controller also generates the multiplexer 
select signals, allowing proper data routing, based on the 
selected mode. 

B. On the Construction of the Master Code 

Conventional approaches construct polar codes for a given 
channel type and condition. In this work, many of the con¬ 
stituent codes contained within a master code are not only used 
internally to detect and correct errors, they are used separately 
as well. Therefore, we propose to assemble a master code 
using two optimized constituent codes in order to increase 
the number of optimized polar codes available. Doing so, the 
number of information bits, or the code rate, of the second 
largest supported codes can be selected. In the following, a 
master code of length 2048 is constructed by concatenating 
two constituent codes of length 1024. The LHS and RHS 
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- Optimized with [12]- Assembled 


Fig. 6: Error-correction performance of two (2048,1365) polar 
codes with different constructions. 


constituent codes are chosen to have a rate of 1 /i and of 
s /6, respectively. As a result, the assembled master code has 
rate 2 /i. The location of the frozen bits in the master code 
is dictated by its constituent codes. Note that the constituent 
code with the lowest rate is put on the left—and the one with 
the highest rate on the right—to minimize the coding loss 
associated with a non-optimized polar code. 

Fig. 6 shows both the frame-error rate (left) and the bit¬ 
error rate (right) of two different (2048,1365) polar codes. 
The black-solid curve is the performance of a polar code con¬ 
structed using the method described in [12] for Eb/No = 4 dB. 
The dashed-red curve is for the (2048,1365) constructed by 
concatenating (assembling) a (1024,512) polar code and a 
(1024, 853) polar code. Both polar codes of length 1024 were 
also constructed using the method of [12] for Eb/No values of 
2.5 and 5 dB, respectively. 

From the figure, it can be seen that constructing an op¬ 
timized polar code of length 2048 with rate 2 /3 results in a 
coding gain of approximately 0.17 dB at a FER of 10 3 —an 
FER appropriate for certain applications—over one assembled 
from two shorter polar codes of length 1024. The gap is 
increasing with the signal-to-noise ratio, reaching 0.24 dB 
at a FER of 10 4 . Looking at the BER curves, it can be 
observed that the gap is much narrower. Compared to that of 
the assembled master code, the optimized polar code shows a 
coding gain of 0.07 dB at a BER of 1(U 5 . 

C. About Constituent Codes: frozen bit locations, rate and 
practicality 

The location of the frozen bits in non-optimized constituent 
codes is dictated by their parent code. In other words, if 
the master code of length N has been assembled from two 
optimized (constituent) polar codes of length N /i as suggested 
in the previous section, the shorter optimized codes of length 
,v /2 determine the location of the frozen bits in their respective 
constituent codes of length < N /i. Otherwise, the master code 
dictates the frozen bit locations for all constituent codes. 

Assuming that the decoding algorithm takes advantage of 
the a priori knowledge of these locations, the code rate and 
frozen bit locations of constituent codes cannot be changed at 




Eb/No (dB) Eb/No (dB) 

- (128, 100) —*— (128,102) -e- (128,107) (128,108) 

Fig. 7: Error-correction performance of the four constituent 
codes of length 128 with a rate of approximately 5 /6 contained 
in the proposed (2048,1365) master code. 

execution time. However, there are many constituent codes to 
choose from and code shortening can be used [18] to create 
more, e.g. in order to obtain a specific number of information 
bits or code rate. 

Because of the polarization phenomenon, given any two sib¬ 
ling constituent codes, the code rate of the LHS one is always 
lower than that of the RHS one for a properly constructed 
polar code [14]. That property plays to our advantage as, in 
many wireless applications, it is desirable to offer a variety of 
codes of both high and low rates. 

It should be noted that not all constituent codes within a 
master code are of practical use e.g. codes of very high rate 
offer negligible coding gain over an uncoded communication. 
For example, among the four constituent codes of length 4 
included in the (16,12) polar code illustrated in Fig. 2a, 
two of them are rate-1 constituent codes. Using them would 
be equivalent to uncoded communication. Moreover, among 
constituent codes of the same length, many codes may have 
a similar number of information bits with little to no error- 
correction performance difference in the region of interest. 

Fig. 7 shows the frame-error rate of all four constituent 
codes of length 128 with a rate of approximately 5 /6 that are 
contained within the proposed (2048,1365) master code. It can 
be seen that, even at such a short length, at a FER of 10 5 the 
gap between both extremes is under 0.5 dB. Among those 
constituent codes, only the (128,108) was selected for the 
implementation presented in Section VI. It is beneficial to limit 
the number of codes supported in a practical implementation 
of a multi-mode decoder in order to minimize routing circuitry. 


D. Latency and Throughput Considerations 

If a decoding algorithm taking advantage of the a priori 
knowledge of the frozen bit locations is used in the unrolled 
decoder, such as Fast-SSC [9], the latency will vary even 
among constituent codes of the same length. However, the 
coded throughput will not. The coded throughput of an un¬ 
rolled decoder for a polar code of length N will be twice that 
of a constituent code of N /i, which in turn, is double that of 
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a constituent code of length n /a, and so on. The coded and 
information throughput are defined by (5). 

In wireless communication standards where multiple code 
lengths and rates are supported, the peak information through¬ 
put is typically achieved with the longest code that has both 
the greatest latency and highest code rate. It is not mandatory 
to reproduce this with our proposed method, but it can be done 
if considered desirable. It is the example that we provide in 
the implementation section of this paper. 

Another possible scenario would be to use a low-rate master 
code, e.g. R = '/a, that is more powerful in terms of error- 
correction performance. The resulting multi-mode decoder 
would reach its peak information throughput with the longest 
constituent code of length N ji that has the highest code rate, a 
code with a significantly lower decoding latency than that of 
the master code. 

VI. Implementation and Results 

In this section, we start by presenting results for dedicated 
unrolled decoders: showing the effect of the initiation interval, 
the code length and the code rate on unrolled decoders. Then, 
we present results for two implementations of our proposed 
multi-mode unrolled decoders. For the latter, we had the 
objective of building decoders with a throughput in the vicinity 
of 20 Gbps. 

The multi-mode decoder examples are built around 
(1024,853) and (2048,1365) master codes. In the following, 
the former is referred to as the decoder supporting a maximum 
code length /V raax of 1024 and the latter as the decoder with 
Mnax = 2048. A total of ten polar codes were selected for the 
decoder supporting codes of lengths up to 2048. The other 
decoder with N mdx = 1024 has eight modes corresponding to 
a subset of the ten polar codes supported by the bigger decoder. 
The master codes used in this section are the same as those 
used in Section V-B. 

For the decoder with (V max = 1024, the Repetition and SPC 
nodes were constrained to a maximum size N v of 8 and 4, 
respectively. For the decoder with /V lnax = 2048, we found it 
more beneficial to lower the execution frequency and increase 
the maximum sizes of the Repetition and SPC nodes to 16 and 
8, respectively. Additionally, the decoder with /V rnax = 2048 
also uses RepSPC [9] nodes to reduce latency. 

A. Methodology 

In our experiments, decoders are built with sufficient mem¬ 
ory to accommodate storing an extra frame at the input, and 
to preserve an estimated codeword at the output. As a result, 
the next frame can be loaded while a frame is being decoded. 
Similarly, an estimated codeword can be read while the next 
frame is being decoded. We define decoding latency to include 
the time required to load channel LLRs, decode a frame and 
offload the estimated codeword. 

The quantization used was determined by running fixed- 
point simulations with bit-true models of the decoders. A 
smaller number of bits is used to store the channel LLRs 
compared to that of the other LLRs used in the decoder. All 
LLRs use 2’s complement representation and share the same 



Fig. 8: Effect of quantization on the error-correction perfor¬ 
mance of a (1024, 512) polar code. 


TABLE I: Decoders for a (1024, 512) polar code with various 
initiation intervals I. The clock is set to 500 MHz and the 
latency is of 728 ns. 


I 

Tot. Area 

(mm 2 ) 

Log. Area 

(mm 2 ) 

Mem. Area 

(mm 2 ) 

T/P 

(Gbps) 

Power 

(mW) 

Energy 

(pj/bit) 

1 

12.369 

0.60 

11.75 

512.0 

3,830 

7.5 

4 

4.921 

0.64 

4.24 

128.0 

1,060 

8.3 

50 

1.232 

0.65 

0.56 

10.2 

107 

10.5 

167 

0.998 

0.63 

0.34 

3.1 

62 

20.0 


number of fractional bits. We denote quantization as Qi-Q c -Qf, 
where Q c is the total number of bits to store a channel LLR, 
Qi is the total the number of bits used to store internal LLRs 
and Qf is the number of fractional bits in both. Q, and Q, both 
include the sign bit. Fig. 8 shows that, for a (1024, 512) polar 
code modulated with BPSK and transmitted over an AWGN 
channel, using Qi-Q c Qf equal to 5.4.0 results in a 0.1 dB 
performance degradation at a bit-error rate of 10 r '. Thus we 
used that quantization for the hardware results. 

ASIC synthesis results are for the 65 nm CMOS GP 
technology from TSMC and are obtained with Cadence RTL 
Compiler. Unless indicated otherwise, all results are for the 
worst-case library at a supply voltage of 0.72 V with an operat¬ 
ing temperature of 125°C. Power consumption estimations are 
also obtained from Cadence RTL Compiler, switching activity 
is derived from simulation vectors. Only registers were used 
for memory due to the lack of access to an SRAM compiler. 

B. Dedicated Decoders: Effect of the Initiation Interval 

In this section, we explore the effect of the initiation interval 
on the implementation of the fully-unrolled architecture. The 
decoders are built for the same (1024, 512) polar code used 
in [11], although many improvements were made since the 
publication of that work. Regardless of the initiation interval, 
all decoders use 5.4.0 quantization and have a decoding latency 
of 364 clock cycles. 

Table I shows the results for various initiation intervals. 
Besides the effect on throughput, increasing the initiation 
interval causes a significant reduction in memory requirements 
without significantly affecting combinational logic. Since area 
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is largely dominated by registers, increasing the initiation 
interval has great effect on the total area. For example, using 
I — 50 results in an area that is more than 10 times smaller, 
at the cost of a throughput that is 50 times lower. That table 
also shows that reducing the area has a direct effect on the 
estimated power consumption, which significantly drops as I. 

As expected, increasing the initiation interval I offers a 
diminishing return as it gets closer to the maximum, 167 for 
the example (1024, 512) code. Also, as I is increased, the 
energy efficiency is reduced. 

C. Dedicated Decoders: Effect of the Code Length and Rate 

Results for other polar codes are presented in this section 
where we show the effect of the code length and rate on 
performance and resource usage. 

TABLE II: Deeply-pipelined decoders for polar codes of 
various lengths with rate R = fi. The clock is set to 500 MHz. 


Tot. Area Log. Area Mem. Area Latency T/P Power Energy 

(mm 2 ) (mm 2 ) (mm 2 ) (ns) (Gbps) (mW) (pJ/bit) 


128 

0.349 

0.05 

0.29 

152 

64 

105 

1.6 

256 

1.121 

0.12 

0.99 

268 

128 

342 

2.7 

512 

3.413 

0.27 

3.14 

408 

256 

1,050 

4.0 

1024 

12.369 

0.60 

11.75 

728 

512 

3,830 

7.5 

2048 

43.541 

1.32 

42.16 

1,304 

1,024 

13,526 

13.2 


Tables II and III show the effect of the code length on 
area, decoding latency, coded throughput, power consumption, 
and on energy efficiency for polar codes of short to moderate 
lengths. Table II contains results for the fully-unrolled deeply- 
pipelined architecture (i~ = 1) and the code rate R is fixed to 
i /2 for all polar codes. Table III contains results for the fully- 
unrolled partially-pipelined architecture where the maximum 
initiation interval (i~ ma x) is used and the code rate R is 5 /6. 

As shown in Table II, with a deeply-pipelined architecture, 
logic area usage almost grows as N\og 2 N , whereas memory 
area is closer to being quadratic in code length N. The 
logic area required for a deeply-pipelined unrolled decoder 
implemented in 65 nm ASIC technology can be approximated 
with an accuracy greater than 98% using C -N\og 2 N, where 
the constant C is set to i/n.ooo. For comparison, the logic area 
of tree-based SC decoders is 0(N ) while the other state-of-the- 
art partially-parallel architectures have fixed logic area that do 
not depend on the code length. 

Curve fitting shows that the memory area is quadratic with 
code length N. Let the memory area be defined by a+bN+cN 2 , 
setting a = 0.249, b = 2.466xlCL 3 and c = 8.912xlCL 6 results 
in a standard error of 0.1839. 

As shown in Table II, throughput exceeding 1 Tbps and 
500 Gbps can be achieved with a deeply-pipelined decoder 
for polar codes of length 2048 and 1024, respectively. As the 
memory area grows quadratically with the code length the 
amount of energy required to decode a bit increases with the 
code length. The decoder for the (4096,2048) polar code could 
not be synthesized on our server due to insufficient memory. 

For a partially-pipelined architecture with I~ max , both the 
memory and total area scale linearly with N. The power 
consumption is shown to almost scale linearly as well. The 


TABLE III: Partially-pipelined decoders with initiation interval 
set to I~ max for polar codes of various lengths with rate R — 5 /6. 
The clock is set to 500 MHz. 


N 

I 

Tot. Area 

(mm 2 ) 

Mem. Area 

(mm 2 ) 

Latency 

(us) 

T/P 

(Gbps) 

Power 

(mW) 

Energy 

(pJ/bit) 

1024 

206 

0.793 

0.28 

0.646 

2.5 

51 

20.5 

2048 

338 

1.763 

0.61 

0.888 

3.0 

108 

35.6 

4096 

665 

4.248 

1.44 

1.732 

3.1 

251 

81.5 


results of Table III also show that it was possible to synthesize 
ASIC decoders for larger code lengths than what was possible 
with a deeply-pipelined architecture. 


TABLE IV: Deeply-pipelined decoders for polar codes of 
length N = 1024 with common rates. The clock is set to 
500 MHz and the throughput is of 512 Gbps. 


R 

Tot. Area 

(mm 2 ) 

Mem. Area 

(mm 2 ) 

Latency 

(CCs) (ns) 

Power 

(mW) 

Energy 

(pJ/bit) 

1/2 

12.369 

11.75 

364 

727 

3,830 

7.5 

73 

13.049 

12.45 

326 

651 

4,041 

6.2 

3/4 

15.676 

15.05 

373 

745 

4,865 

6.5 

5/6 

14.657 

14.05 

323 

645 

4,549 

7.1 


The effect of using different code rates for a polar code of 
length N — 1024 is shown in Table IV. We note that the higher 
rate codes do not have noticeably lower latency compared to 
the rate- 1 ji code, contrary to what was observed in [9], This 
is due to limiting the width of SPC nodes to /Vspc = 4 in this 
work, whereas it was left unbounded in the others. The result 
is that long SPC codes are implemented as trees whose left¬ 
most child is a width-4 SPC node and the others are all rate-1 
nodes. Thus, for each additional stage (log 2 N v - log 2 /Vspc) of 
an SPC code of length N v > Nspc, four nodes with a total 
latency of 3 clock cycles are required: F, G followed by /, 
and Combine. This brings the total latency of decoding a long 
SPC code to 3(log 2 /V,, -log 2 /Vspc) + 1 clock cycles compared 
to [A/p] +4 in [9], where P is the number of LLRs that can 
be read simultaneously (256 was a typical value for P in [9]). 

From Table IV, it can be seen that varying the rate does 
not affect the logic area that remains almost constant at 
approximately 0.61 mm 2 . Memory, in the form of registers, 
dominates the decoder area. Therefore, the estimated power 
consumption scales according to the memory area. 

D. Deeply-pipelined SC Decoders 

To decode a frame, an SC decoder needs to load a frame, 
visit all 2' edges of the decoder tree twice and store 

the estimated codeword. A deeply-pipelined SC decoder for 
a (128, 64) polar code has an area of 2.17 mm 2 , a latency 
of 510 clock cycles, and a power consumption of 677 mW. 
These values are 6.2, 6.7, and 6.4 times as much as their 
counterparts of the deeply-pipelined Fast-SSC decoder re¬ 
ported in Table II. These results indicate that deeply-pipelined 
SC decoders will be limited to very short polar codes, and 
that alternative algorithms and architectures will yield more 
practical implementations. 
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Fig. 9: Error-correction performance of the polar codes. 

E. Multi-mode Decoders: Error-correction Performance 

Fig. 9 shows the frame-error rate performance of ten differ¬ 
ent polar codes. The decoder with /V max = 2048 supports all ten 
illustrated polar codes whereas the decoder with /V lnax = 1024 
supports all polar codes but the two shown as dotted curves. 
All simulations are generated using random codewords mod¬ 
ulated with binary phase-shift keying and transmitted over an 
additive white Gaussian channel. 

It can be seen from the figure that the error-correction 
performance of the supported polar codes varies greatly. As 
expected, for codes of the same lengths, the codes with 
the lowest code rates performs significantly better than their 
higher rate counterpart. For example, at a FER of 10 4 , the 
performance of the (512,363) polar code is almost 3 dB better 
than that of the (512,490) code. 

While the error-correction performance plays a role in the 
selection of a code, the latency and throughput are also 
important considerations. As it will be shown in the following 
section, the ten selected polar codes perform much differently 
in that regard as well. 

F. Multi-mode Decoders: Latency and Throughput 

Table V shows the latency and information throughput for 
both decoders with /V max e (1024,2048). To reduce the area 
and latency while retaining the same throughput, the initiation 
interval I can be increased along with the clock frequency (5). 

If both decoders have initiation intervals of 20—as used 
in the section below—Table V assumes clock frequencies of 
500 MHz and 250 MHz for the decoders with /V rnax = 1024 
and /Vmax = 2048, respectively. While their master codes differ, 
both decoders feature a peak information throughput in the 
vicinity of 20 Gbps. For the decoder with the smallest /V max , 
the seven other polar codes have an information throughput 
in the multi-gigabit per second range with the exception of 
the shortest and lowest-rate constituent code. That (128,39) 
constituent code still has an information throughput close 
to 1 Gbps. The decoder with /V rnax = 2048 offers multi¬ 
gigabit throughput for most of the supported polar codes. The 
minimum information throughput is also with the (128,39) 
polar code at approximately 500 Mbps. 


TABLE V: Information throughput and latency for the multi- 
mode unrolled polar decoders based on the (2048,1365) and 
(1024,853) master codes, respectively with a /V max of 1024 
and 2048. 


Code Rate Info. T/P (Gbps) Latency (CCs) Latency (ns) 



P/ N > N 

^ V max 

= 1024 

2048 

1024 

2048 

1024 

2048 

(2048, 1365) 

2/3 

- 

17.1 

- 

503 

- 

2,012 

(1024, 853) 

5/6 

21.3 

10.7 

323 

236 

646 

944 

(1024, 512) 

1/2 

- 

6.4 

- 

265 

- 

1,060 

(512, 490) 

19/20 

12.3 

6.2 

95 

75 

190 

300 

(512, 363) 

7/10 

9.1 

4.5 

226 

159 

452 

636 

(256, 228) 

9/10 

5.7 

2.6 

86 

61 

172 

244 

(256, 135) 

1/2 

3.4 

1.7 

138 

96 

276 

384 

(128, 108) 

5/6 

2.7 

1.4 

54 

40 

108 

160 

(128, 96) 

3/4 

2.4 

1.2 

82 

52 

164 

208 

(128, 39) 

1/3 

0.98 

0.49 

54 

42 

108 

168 


In terms of latency, the decoder with ;V max = 1024 requires 
646 ns to decode its longest supported code. The latency for 
all the other codes supported by that decoder is under 500 ns. 
Even with its additional dedicated node and relaxed maximum 
size constraint on the Repetition and SPC nodes, the decoder 
with N max = 2048 has greater latency overall because of its 
lower clock frequency. For example, its latency is of 2.01 /is, 
944 ns and 1.06 /is for the (2048,1365), (1024,853) and 
(1024,512) polar codes, respectively. 

Using the same nodes and constraints as for /V rnax = 1024, 
the /V max = 2048 decoder would allow for greater clock fre¬ 
quencies. While 689 clocks cycles would be required to decode 
the longest polar code instead of 503, a clock of 500 MHz 
would be achievable, effectively reducing the latency from 
2.01 /is to 1.38 /is and doubling the throughput. However, 
this reduction comes at the cost of much greater area and an 
estimated power consumption close to 1 W. 

G. Comparing with the State of the Art 

Table VI shows the synthesis results along with power 
consumption estimations for the two implementations of the 
proposed multi-mode unrolled decoder. The work in the first 
two columns is for the decoder with /V max = 1024, based 
on the (1024,853) master code. It was synthesized for clock 
frequencies of 500 MHz and 650 MHz, respectively, with 
initiation intervals I of 20 and 26. Our work shown in the third 
and fourth columns is for the decoders with A' rnax = 2048, built 
from the assembled (2048,1365) polar code. These decoders 
have an initiation interval I of 20 or 28, with lower clock 
frequencies of 250 MHz and 350 MHz, respectively. For com¬ 
parison with other works, the same table also includes results 
for a dedicated partially-pipelined decoder for a (1024,512) 
polar code. 

The four fastest polar decoder implementations from the 
literature are also included for comparison along with nor¬ 
malized area results. For consistency, only the largest polar 
code supported by each of our proposed multi-mode unrolled 
decoders is used and the coded throughput, as opposed to the 
information one, is compared to match what was done in most 
of the other works. 

From Table VI, it can be seen that the area for the proposed 
decoders with /V rnax = 1024 are similar to that of the BP 
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TABLE VI: Comparison with state-of-the-art polar decoders. 




Multi-mode 


Dedicated 

[19] 

[20]- 

[17] 

[8] 

Algorithm 


Fast-SSC 


Fast-SSC 

Fast-SSC 

BP 

SC 

2-bit SC 

Technology 


65 

nm 


65 nm 

65 nm 

65 nm 

90 nm 

45 nm 

W max 

1024 

2048 

1024 

1024 

1024 

1024 

1024 

Code 

(1024,853) 

(2048,1365) 

(1024,512) 

(1024,512) 

(1024,512) 

(1024, k) 

(1024,512) 

Init. Interval (I) 

20 

26 

20 

28 

20 

- 

- 

- 

- 

Supply (V) 

0.72 

1.0 

0.72 

1.0 

1.0 

1.0 

1.0 

1.3 

N/A 

Oper. temp. (°C) 

125 

25 

125 

25 

25 

25 

* 25 

N/A 

N/A 

Area (mm 2 ) 

1.71 

1.44 

4.29 

3.58 

1.68 

0.69 

1.48 

3.21 

N/A 

Area @65nm (mm 2 ) 

1.71 

1.44 

4.29 

3.58 

1.68 

0.69 

1.48 

1.68 

0.4 

Frequency (MHz) 

500 

650 

250 

350 

500 

600 

300 

2.5 

750 

Latency (jus) 

0.65 

0.50 

2.01 

1.44 

0.73 

0.27 

50 

0.39 

1.02 

Coded T/P (Gbps) 

25.6 

25.6 

25.6 

25.6 

25.6 

3.7 

4.7 @ 4 dB 

2.56 

1.0 

Sust. Coded T/P (Gbps) 

25.6 

25.6 

25.6 

25.6 

25.6 

3.7 

2.0 

2.56 

1.0 

Area Elf. (Gbps/mm 2 ) 

15.42 

17.75 

5.97 

7.16 

15.27 

5.40 

3.18 @ 4 dB 

0.80 

N/A 

Power (mW) 

226 

546 

379 

740 

386 

215 

478 

191 

N/A 

Energy (pJ/bit) 

8.8 

21.3 

14.8 

28.9 

15.1 

57.7 

102.1 

74.5 

N/A 


o Measurement results. 


decoder of [20] as well as the normalized area for the unrolled 
SC decoder from [17]. However, their area is from 2.1 to 2.5 
times greater than that of [19], Comparing the multi-mode 
decoders, the area for the decoder with /V max = 2048 is over 
twice that of the ones with /V max = 1024, however the master 
code for the former has twice the length of the latter and 
supports two more modes. 

All proposed decoders have a coded throughput that is an 
order of magnitude greater than the other works. Latency 
is one to two orders of magnitude lower than that of the 
BP decoder. Comparing against the SC decoder of [17], the 
latency is 1.7 or 3.7 times greater for decoders with an /V max 
of 1024 and 2048, respectively. It should be noted that the 
decoder of [17] support codes of any rate, where the proposed 
multi-mode decoders support a limited number of code rates. 

The latency of the proposed decoders is higher than the 
programmable Fast-SSC decoder of [19]. This is due to greater 
limitations on the specialized repetition and SPC decoders. 
The decoder in [19] limits repetition decoders to a maximum 
length of 32, compared to 8 or 16 in this work, and does not 
place limits on the SPC decoders. 

Finally, among the decoders with N m . dx = 1024 implemented 
in 65 nm with a 1 V power supply and operating at 25 °C, our 
proposed implementation offers the greatest area and energy 
efficiency. The proposed multi-mode decoder exhibits 3.3 and 

5.6 times better area efficiency than the decoders of [19] and 

[20] , respectively. The energy efficiency is estimated to be 

2.7 and 4.8 times higher compared to that of the same two 
decoders from the literature. 

Recently, a List-based multi-mode decoder was proposed in 

[21] , where the definition of the word “multi-mode” differs 
greatly with our work: in our work, it is used to indicate that 
the decoder is capable of decoding codes with varying length 
and rate. Whereas in [21], a “mode” indicates the level of 
parallelism in the decoder. The decoder of [21] is capable of 
decoding 4 paths in parallel by implementing 4 processing 
units. It can be configured to either do SC-based decoding of 
4 frames or List-based decoding. For the latter, two list sizes 
L are supported. If L = 2, 2 frames are decoded in parallel 
otherwise if L = 4, only 1 frame is decoded at a time. 


H. I/O Bounded Decoding 

The family of unrolled architectures that we proposed 
requires tremendous throughput at the input of the decoder, 
especially with a deeply-pipelined architecture. For example, 
if a quantization of Q c — 4 bits is used for channel LLRs, for 
every estimated bit, 4 times as many bits have to be loaded 
into the decoder. In other words, the total data rate is 5 times 
that of the output. This can be a significant challenge on both 
FPGA and ASIC. If only for that reason, partially-pipelined 
architectures are certainly more attractive. 

VII. Conclusion 

In this paper we presented a family of architectures for fully- 
unrolled polar decoders. With an initiation interval that can be 
adjusted, these architectures make it possible to find a trade¬ 
off between area and achievable throughput without affecting 
decoding latency. We showed that a fully-unrolled deeply- 
pipelined decoder implemented on an ASIC could achieve a 
throughput up to three orders of magnitude greater than the 
state of the art. Furthermore, we presented a new method to 
transform an unrolled architecture into a multi-mode decoder 
supporting various polar code lengths and rates. We showed 
that a master code can be assembled from two optimized 
polar codes of smaller length, with desired code rates, without 
sacrificing too much coding gain. We provided results for 
two decoders, one built for a (1024,853) master code and 
the other for a longer (2048,1365) polar code. Both decoders 
support from seven to nine other practical codes. On 65 nm 
ASIC, they were shown to have a peak throughput greater than 
25 Gbps. One has a worst-case latency of 2 /is at 250 MHz 
and an energy efficiency of 14.8 pJ/bit. The other has a worst- 
case latency of 646 ns at 500 MHz and an energy efficiency 
of 8.8 pJ/bit. Both implementation examples show that, with 
their great throughput and support for codes of various lengths 
and rates, multi-mode unrolled polar decoders are promising 
candidates for future wireless communication standards. 
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